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Abstract 

We derive approximation algorithms for the nonnegative matrix factorization 
problem, i.e. the problem of factorizing a matrix as the product of two matrices 
with nonnegative coefficients. We form convex approximations of this problem 
which can be solved efficiently and test our algorithms on some classic numerical 
examples. 



1 Introduction 



Nonnegative matrix factorization (NMF) is a classic unsupervised technique to learn a parts-based 
representation of the data in an additive setting. As such, it is used as a factor analysis tool for 
high-dimensional data whose components are constrained to be nonnegative. Given a data matrix 
A E R mx ™, we write the nonnegative matrix factorization problem as: 

minimize 1oss(j4, UV t ) 

subject to U,V>0, ( ) 

in the variables U £ R mxfc and V <E R nxk 7 where loss(X, Y) is a loss function, and k is a given 
rank target. This apparently simple problem can be traced back to JT] and 12 and has found many 
applications in machine learning and statistics. It was used for example in gene expression data 
analysis (see e.g. Q), in signal processing (see flU), as a clustering tool (see e.g. and 0), for 
image analysis (see [7|), etc. If k > min(m, n), A = AI is always a solution, so our objective 
is to make this representation as parsimonious as possible and keep k small. The decomposition 
is of course not unique and this and other consistency issues were explored in 0. As a factor 
analysis technique, NMF is very similar to Principal Component Analysis, however PCA amounts 
to a simple singular value decomposition of the data matrix which is computationally easy, while 
NMF is a NP-Hard problem. Furthermore, in all of these applications, the nonnegativity constraint 
on the components is the result of some physical property of the data, hence cannot be lifted. 



1.1 Current methods 

Here, we briefly summarize the main types of algorithms currently used to solve it. We refer the 
reader to 10 for a more complete survey. The algorithms listed below have all been implemented in 
MATLAB libraries such as NMFLAB by or NMFPACK by IflOl. 

Multiplicative update. The original algorithm in [2 |, when the loss is given by the Mean Squared 
Error, updates the current iterates U and V as follows: 

(U T A) 31 „„, TT+ _ TT {AV)a 



VP, 



and Ul 



Ui, 



(U T UV T ) 3l » lJ {UV T V) tJ ' 

Similar updates exist for other loss functions. This is a descent method but, in this form, it is not 
guaranteed to converge to a local minimum. 
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Gradient descent. Another set of algorithms (see ifTTl or ifTUl among others) directly apply a 
gradient descent algorithm to problem (Q~|). 

Alternating least squares. When the loss function is given by the MSE, a third group of algo- 
rithms take advantage of the fact that the minimization problem in only one of the variables is 
equivalent to a least-squares problems. These block-coordinate descent algorithms (see [ 12 | among 
others) minimize the loss by alternating between the LS problems in U and V. 

Recently, IfTUl (see also 0~2]) also added a penalty term to problem (Q~|i to make the matrices U and 
V sparse, which improves interpretability in imaging applications for example. 

1.2 Contribution 

All the algorithms above have one common characteristic, they all solve a nonconvex formulation 
of the nonnegative matrix factorization problem. In particular, this means that they all seek local 
solutions to the original problem. This creates stability issues, i.e. the solutions are very sensitive 
to the choice of initial point, it also creates complexity issues, meaning that no precise bound can be 
given on the computing time required to solve the problem and suboptimality cannot be measured by 
computing the duality gap. Here, we begin by formulating a convex approximation to nonnegative 
matrix factorization, we then solve the approximate problem using convex optimization methods. 
This means that we find global (hence potentially more stable) solutions to the approximate problem 
with guaranteed complexity bounds. 

In the symmetric case, we first show that the NMF problem can be formulated as the problem of 
approximating a given matrix by a completely positive matrix. We then use a convex representation 
of a restriction of the set of completely positive matrices to write a convex restriction of the sym- 
metric NMF problem. In other words, we show that solving problem (fl]i over a subset of all the 
possible choices of U and V is equivalent to a convex problem. We then extend these results to the 
nonsymmetric case. 

The paper is organized as follows, in Section|2]we detail our convex approximations of problem (fl} 
in both the symmetric and nonsymmetric case. We present some simple algorithms in Section [3] 
Finally, in Section|4j we compare our methods with existing algorithms in numerical examples. 

2 Convex approximations 

In this section we derive a convex approximation to problem (Q]). We first discuss the case where the 
matrix A is symmetric, then generalize our results to nonsymmetric matrices. 

2.1 Symmetric case 

In this section, given a data matrix A € S n , we focus on the following symmetric nonnegative matrix 
factorization problem: 

minimize loss(A, UU T ) 

subject to U > 0, ( ' 

in the variable U G R™ xfe . In this paper, we consider two classical choices for the loss function 
given by: 

Mean Squared Error (MSE): loss(X,F) = ||X-y|| 2 

Kullback-Leibler (KL): loss(X, Y) = £" J=1 (X tJ \og(X tJ /Y tJ ) + Y l3 - X i5 ). 

Both losses are convex functions of either X or Y but are not jointly in convex in (X, Y), which 
means ([]]) is a nonconvex problem. 

2.1.1 Completely positive matrices 

The solutions to this symmetric NMF problem, i.e. the symmetric matrices A 6 S™ which can be 
written in the form: 

A = UU T , U > (3) 
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where U £ R" xfe , are called completely positive (see |[T3l for a complete discussion). This can 

also be written A = Yl%=i u i u J with Ui > and the smallest k for which this representation holds 
is called the cp-rank of A. This provides us with another interpretation of problem (|2): it seeks 
the closest completely positive matrix to a given a matrix A. It also illustrates the difficulty of this 
problem: the set of completely positive matrices forms a cone whose dual is the cone of copositive 
matrices and testing for the copositivity of a matrix is a well-known NP-Hard problem. Also, |[T4]l 
shows that any continuous or binary nonconvex program can be written as a linear program over the 
cone of completely positive matrices. The fundamental result we use in this section is given by the 
following theorem. 

Theorem 1 (Th. 2.30 in |13|) If X £ S™ is positive semidefinite, then exp H (X), the Hadamard 
(or componentwise) exponential of X, is completely positive. 

This means that after a natural change of variables Aij — exp(Xy), we get a sufficient, convex 
condition on X for representation (0 to hold. This also shows that kernel matrices obtained by 
negative exponentiation of negative semidefinite distance matrices are completely positive, hence 
can be interpreted as linear kernels over nonnegative feature vectors. 



2.1.2 Convex restriction 

The above result allows us to form a convex restriction of the symmetric NMF problem in (O as: 

minimize 1oss(j4, exp H (X)) ... 
subject to X y 0, { > 

in the variable X £ S™. When the loss function is given by the KL divergence, this problem 
becomes: 

minimize Y,ij=i A ij{ io &( A ij) ~ x ij) + exp(Xy) - A i3 - 

subject to X y 0, ( ' 

which is a convex optimization problem in the variable X 6 S™. When the loss is given by the 
MSE, the objective function is not convex in X unless we impose the additional constraint that 
< exp(Xy). Problem (0 then becomes: 

minimize Yn,j=i (exppQj ) - A rj ) 2 

subject to Aij/2 < exp(Xy), i,j = l,...,n (6) 
XhQ, 

which is also a convex problem in the variable X £ S™. 



2.1.3 Factorizing cxp H (X) when X is positive semidefinite 

We know from Theorem[T]that exp H (X) is completely positive when X is positive semidefinite so 
there is a matrix L such that 

exp H (X) = UU T , 

where U £ j(j ixk . First, Caratheodory's theorem allows us to bound the size of U (i.e. the cp-rank 
of exp H (X)), and Theorem 3.5 in (T3| shows that we can get: 

r(r + l) 

k < -^-t- - 1, (7) 

where r = Rank(exp ff (X)). Also, the Hadamard (or componentwise) product of two com- 
pletely positive matrices is completely positive: suppose A = 5Zi=i a i a I an d B = XL=i bibj 
with ai, hi > 0, then: 

k I 
i=l j=l 

hence the Hadamard product A o B is completely positive as the sum of nonnegative rank one 
matrices. Now, because X is positive semidefinite, we can write: 



exp H (X) =Y[exp H (A 



i=l 
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where the matrix product is understood componentwise and Xi £ R™ , which means that exp H (X) 
can be written as the Hadamard product of matrices of the type exp H (vv T ). As in [fl~3| Theorem 
2.30, we let M — max i=1 n then 

exp H (vv T )ij = exp(viVj) — exp(— M 2 + (M + Xi){M + Xj) — M(xt + xj)), 

so 

exp H (vv T ) = exp(-M 2 ) exp H (yy T ) o zz T , 
where y = Ml + v and z — exp H (—Mv) are both nonnegative vectors. Because y is nonnegative 
and 

i=l 

with (yy T )°j — (yiyj) k , the matrix exp H (yy T ) is completely positive. Hence, as the Hadamard 
product of two completely positive matrices, exp H (vv T ) is completely positive. To summarize, 
when X is positive semidefinite, we factorize the matrix exp H (X) as follows: 



Factorizing exp H (X) 

1. Compute the eigenvalue decomposition: X = Y^i=i ^i x i x J ■ 

2. Decompose each factor, exp H (vivf) — exp(—M 2 ) exp H (yiyf)o Zizf where Vi = \f\xi 
and yi, Zi are nonnegative vectors. 

3. Approximate exp H (y t yf) as X)* =1 (yy T ) ok /k\. 

4. Collect all the terms above using the chain rule in © to get exp H (X) = UU T . 



Without any further processing, the size of U can quickly become very large. Because of the bound 
given in (0, we know however that the number of columns of U is bounded above by r(r + l)/2 — 1 
where r = Rank(exp H (X)) and we can use this result to simplify the decomposition, but this 
is numerically costly and typically unnecessary. In practice, the eigenvalues of cxp H (vivf) are 
decreasing exponentially fast and we can use the fact that when X £ S n is a positive semidefinite 
matrix with nonnegative coefficients and Rank(X) = 2, then X is completely positive. We then 
replace exp H (vivf) by a rank two approximation which, if it is nonnegative, means that the size k 
of the matrix U £ ~R nxk is given by k = 2 r where r is the number of significant factors in X, which 
is typically small. The precision of that approximation is further studied in Section|4] 

2.2 Sparse decomposition 

As suggested in [ 1 1 (see also |[T2l ). e.g. when the matrix A itself is sparse, we look for a sparse 
decomposition A = UU T , i.e. a decomposition where the matrices U are sparse. In that case, the 
change of variable A — exp(Xij) is not appropriate. We can however look directly for a low rank 
decomposition by exploiting the property detailed above that when X £ S™ is a positive semidefinite 
matrix with nonnegative coefficients and Rank(X) = 2, then X is completely positive. We then 
solve: 

minimize \\A - X\\ 2 + -y\X\ + v Tr(X) 

subject to X > 0, X > 0, { ' 

in the variable X £ S", where |Jf| = Y^j=i l^y|> 7 is a parameter controlling the sparsity of X 
and v is a penalty on its rank (see lfl5l for details). 

2.3 Recursive decomposition 

Because the condition in Theorem[T]is only sufficient, problems (f5]) and ((6]) only cover a subset of 
all the possible nonnegative factorizations of the data matrix A. To overcome this limitation, we can 
solve problem (f2]i recursively, setting Aq = A and 

A k+1 =A k - U k Ul, 

where Uk is the solution to the factorization problem given A k . To ensure, that the intermediate 
matrices A k remain nonnegative, we can simply add the (convex) constraint that UkU k < A to 
problems (f5J and ©. 
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2.4 Nonsymmetric case 



We now extend the results of the previous section to the symmetric case. Given a data matrix 
A e R" IX '\ we write the nonnegative matrix factorization problem as: 

minimize 1oss(j4, UV t ) , lm 
subject to U,V>0, ( ' 

in the variables U € R mxfc and V <E R nxfc ^ w here loss(X, Y) is one of the loss functions given 
above. Because the matrix A has a nonnegative factorization if and only if there are matrices B,C £ 
S" such that the symmetric block-matrix: 

B A 
A T C 

is completely positive. Of course any nonnegative matrix can be factorized as the product of two 
nonnegative matrices because A — AI is always a solution. Our objective here is to make this 
representation as parsimonious as possible and find a solution with minimum cp-rank. We can't 
minimize the cp-rank of the decomposition directly without making the problem nonconvex, how- 
ever the bound in (Q shows that we can use the rank of a matrix as a proxy for its cp-rank. We know 
from 1 15| that when X 6 R mx ™^ ||X||*, the trace norm of X is the largest convex lower bound on 
Hank(X). We also know that < t if and only if there are symmetric matrices Y, Z such that: 

£t X z ) h and Tr(Y) + Tr(Z) < 2t. 

The problem of finding a low cp-rank nonnegative factorization of a matrix A E R mx " can th en be 
written: 

minimize loss(A, X) + 7(Tr(F) + Tr(Z)) 

( Y X \ , , (11) 
subject to I g I completely positive, 

in the variables X e R mxn , y g S m and Z e S™, where 7 > controls the rank of the solution. 
This is now a symmetric NMF problem and can be solved using the results detailed in the previous 
sections. 



3 Algorithms 

The results in © and (O show that the symmetric NMF problem can be approximated by convex 
problems for which efficient, globally convergent algorithms are available. Here, because our focus 
is on solving large-scale problems with a relatively low precision, we use simple first-order methods 
to solve the optimization problems detailed in Section|2] 

3.1 Projected gradient method 

The simplest of these algorithms is the projected gradient method. Suppose that we need to minimize 
a convex function f(x) over a convex set C. The projected gradient algorithms works as follows: 



Projected Gradient Method 

1. Start from a point x € R™. 

2. Compute Xk+i = Pc{xk + V/(xfc)), where pc{x) is the projection of x on the set C. 

3. Repeat step 2 until precision target is reached. 



Applying this method to problem (0 for example, to solve: 

minimize J27j=i ^y( lo g(^ij) ~ X v) + exp(Xij) - A tj 
subject to X >z 0, 

we get pc explicitly as pc{X) = X + , where the matrix X + is obtained by zeroing out the negative 
eigenvalues of the matrix X. 
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3.2 Complexity 



Provided some smoothness assumptions on the objective function, the number of iterations of the 
generic projected gradient method grows as 0(l/e 2 ) where e is the target precision. 

3.3 Convergence and duality gap 

The dual of problem © is given by: 

maxmimize E"j=i A ij + Y *3 ~ ( A ij + Y ij) lo s( A ij + Y ij) (m 
subject to Y y 0, 

in the variable Y £ S™. The optimality conditions impose: 

Y = cxv H (X)-AhO, 

where X, Y are primal and dual solutions to (0. This means that if X is the current primal point, 
then (exp H (X) — A) + is a dual feasible point which can be used to compute a duality gap, measure 
the optimality of X and track convergence. 



4 Numerical Results 

In this section, we test the performance of our algorithm for solving the symmetric NMF problem 
on graph partitioning problems. 

4.1 Graph partitioning using NMF 

Let A £ {0,l} nx ™be the adjacency matrix of a given graph G, with 

. _ f 1 if (i, j) is an edge of G 
lJ \ otherwise. 

Suppose that we want to partition this graph into k clusters Ck while minimizing the number of 
graph edges between clusters and maximizing the number of graph edges inside clusters. For a 
given partition C of a graph with adjacency matrix A £ {0, 1}™ X " ; the performance measure we 
use here is given by: 

#{(i,i)l^£? =1 *tf*ii} 

perf(C) = 1 ^ - 2 >- (13) 



where X £ {0, l}" xfc is an indicator matrix such that: 

Xik = 



1 if node i is in cluster Ck 
otherwise, 

which satisfies XI = 1. The graph partitioning problem can then be formulated as: 

minimize 1 1 A - XX T \ \ 2 
subject to XI = 1, 



(14) 



in the variables X £ {0, l}" xfe . This is a (hard) binary optimization problem, which we relax into 
a symmetric nonnegative matrix factorization problem as in |[T6l . to get: 

minimize ||j4 — XX T \\ 2 

subject to X>0, { ' 

in the variable X £ R™ xfc ? which can be solved using the algorithms detailed in Section [3] We 
can then turn the solution X of this relaxation into an indicator matrix by setting the maximum 
coefficient of each row to one and all the others to zero. 
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4.2 Partitioning performance 



To test the performance of our algorithms for symmetric NMF on graph partitioning problems, we 
generate random graphs whose adjacency matrices have given block sparsity patterns. We generate 
three uniform random matrices A, B E S" and C E R nx " and use two parameters a, /3 E [0, 1] to 
control the sparsity and form a sample graph adjacency matrix as: 

/ l jA iS >a} l{C«>/9} \ 

V l {C ij >f)} 1 {B«>«} J 

then randomly permute the matrix. An example is detailed in Figure Q] We then compare the 



•••••• ••••••• • 



• • • 
»• ••• 
»• • • 
• • •« 



> •••• 



Original adjacency matrix 



»••••• ••• 

• • •••••••••• 



Clustered permutation 



Figure 1: Graph partitioning example: original (randomly generated) adjacency matrix on the left, clustered 
permutation on the right obtained by solving problem d 1 5 b using the algorithms in SectionfJ] 

performance of our code with that of spectral clustering (see [ 17 1 for example) and show the results 
in Figure |2] below. We observe that both methods perform similarly well on clearly clustered data 
but that the NMF solution dominates spectral clustering as the graphs become closer to bipartite. 

4.3 Convergence speed 

In Figure |2] we plot number of iterations of the projected gradient algorithm versus matrix size n in 
randomly generated problems for various problem sizes. 




Figure 2: Left: Average number of iterations versus matrix size for the projected gradient algorithm. Right: 
Average performance versus graph connectivity for spectral clustering (squares) and the solution to the NMF 
problem a 15b using the algorithms in Sectionf3](circles, dashed lines at plus and minus one standard deviation). 
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4.4 Extensions 



While the results in the symmetric case seem to perform very well in numerical experiments, this 
is not the case for nonsymmetric problems where alternating projection methods using highly opti- 
mized interior point solvers are still at least an order of magnitude faster than our method. At this 
point, efficiently exploiting the NMF representation detailed here in the nonsymmetric case remains 
an open numerical problem. 
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